class: title-slide, right, top background-image: url(data:image/png;base64,#img/hex_forcats.png), url(img/axsome_logo.png) background-position: 93% 63%, 50% 50% background-size: 10%, 40%
.right-column[ # Module 12: Factors with Forcats ### **Graham Eglit**<br> Axsome Therapeutics<br> Fall 2024 ] --- class: inverse, center, middle # Factors Basics! ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # Factors ] .pull-left[ - Factors are used to work with categorical variables - Categorical variables = variables with a fixed and known set of possible values - e.g., eye color, treatment group, etc. <br> <br> - We've come across factors before back in Module 2 - Recall that factors are integers that look like characters - Ideal for statistical modeling - We'll also use them for ordering character strings - e.g., for visualization ] .pull-right[ <br> <img src = "data:image/png;base64,#https://media.giphy.com/media/ibAtCaoRZHNFKP9te0/giphy.gif" /> .center[ .caption[ Via [Giphy](https://media.giphy.com/media/ibAtCaoRZHNFKP9te0/giphy.gif) ] ] ] --- .center[ # Creating Factors ] - Use the `factor` function to create factors - Use the `levels` argument to specify levels of the factor - Use the `labels` argument to change the labels associated with each level of a factor ```r x <- c("Tues", "Sat", "Thur") x ``` ``` ## [1] "Tues" "Sat" "Thur" ``` ```r x <- factor(x, levels = c("Sun", "Mon", "Tues", "Wed", "Thur", "Fri", "Sat"), labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")) x ``` ``` ## [1] Tuesday Saturday Thursday ## Levels: Sunday Monday Tuesday Wednesday Thursday Friday Saturday ``` --- class: inverse, center, middle # General Social Survey ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # The General Social Survey: `gss_cat` ]
--- class: inverse, center, middle # Modifying Factor Order ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # fct_reorder ] .panelset[ .panel[.panel-name[Overview] - all `forcats` functions begin with `fct_*` - easy to use with function completion <br> <br> - `fct_reorder` re-orders levels of a factor - arguments - `f` = factors whose levels you want to modify - `x` = a numeric vector that you want to use to reorder the levels - `fun` = optional, a function that's used if there are multiple values of `x` for each value of `f` ]<!----> .panel[.panel-name[Original Visualization] .pull-left[ ```r relig_summary <- gss_cat %>% group_by(relig) %>% summarise( age = mean(age, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE), n = n() ) ggplot(relig_summary, aes(tvhours, relig)) + geom_point() ``` ] .pull-right[ <!-- --> ] ]<!----> .panel[.panel-name[Re-Ordered Visualization] .pull-left[ ```r relig_summary %>% mutate(relig = fct_reorder(relig, tvhours)) %>% ggplot(aes(tvhours, relig)) + geom_point() ``` ] .pull-right[ <!-- --> ] ]<!----> ]<!--end panelset--> --- .center[ # How Could we Improve This Plot? ] .panelset[ .panel[.panel-name[Ideas?] .center[ <!-- --> ] ]<!----> .panel[.panel-name[Option 2] .pull-left[ ```r relig_summary %>% mutate(relig = fct_reorder(relig, tvhours)) %>% ggplot(aes(tvhours, relig)) + geom_col(aes(fill = relig)) + geom_text(aes(label = round(tvhours, 2)), hjust = 1.2, color = "white", fontface = 2) + theme_light() + labs(x = "Hours of TV Consumption", y = NULL, fill = "Religious Affiliation") ``` ] .pull-right[ <!-- --> ] ]<!----> .panel[.panel-name[Option 2] .pull-left[ ```r relig_summary %>% mutate(relig = fct_reorder(relig, tvhours)) %>% ggplot(aes(x = tvhours, y = relig)) + geom_segment(aes(x = 0, xend = tvhours, y = relig, yend = relig)) + geom_point() + scale_x_continuous(expand = c(0, 0, 0, .5)) + labs(x = "Hours of TV Consumption", y = NULL, fill = "Religious Affiliation") + theme_light() ``` ] .pull-right[ <!-- --> ] ]<!----> .panel[.panel-name[Option 3] .pull-left[ ```r relig_summary %>% mutate(relig = fct_reorder(relig, tvhours)) %>% ggplot(aes(x = tvhours, y = relig, label = round(tvhours, 2))) + geom_segment(aes(x = 0, xend = tvhours, y = relig, yend = relig)) + geom_label() + scale_x_continuous(expand = c(0, 0, 0, .5)) + labs(x = "Hours of TV Consumption", y = NULL, fill = "Religious Affiliation") + theme_light() ``` ] .pull-right[ <!-- --> ] ]<!----> ]<!--end panelset--> --- .center[ # fct_relevel ] .panelset[ .panel[.panel-name[Overview] - `fct_relevel` re-levels a factor - arguments - `f` = factors whose levels you want to modify - `...` = any number of levels that you want to move to the front of the line <br> <br> - `fct_reorder` changes the order of factor levels using a numeric vector or an arithmetic function (e.g., mean) - `fct_relvel` changes to order of factor levels through manual re-specification ]<!----> .panel[.panel-name[Original Visualization] .pull-left[ ```r rincome_summary <- gss_cat %>% group_by(rincome) %>% summarise( age = mean(age, na.rm = TRUE), tvhours = mean(tvhours, na.rm = TRUE), n = n()) ggplot(rincome_summary, aes(age, rincome)) + geom_point() ``` ] .pull-right[ <!-- --> ] ]<!----> .panel[.panel-name[Re-Ordered Visualization] .pull-left[ ```r ggplot(rincome_summary, aes(age, fct_relevel(rincome, "Not applicable"))) + geom_point() ``` ] .pull-right[ <!-- --> ] ]<!----> ]<!--end panelset--> --- .center[ # fct_reorder2 ] - `fct_reorder2` reorders the factor by the y values associated with the largest x values .panelset[ .panel[.panel-name[Create Data] .pull-left[ ```r by_age <- gss_cat %>% filter(!is.na(age)) %>% count(age, marital) %>% group_by(age) %>% mutate(prop = n / sum(n)) ``` ] .pull-right[ ```r by_age %>% head(n = 5) ``` ``` ## # A tibble: 5 × 4 ## # Groups: age [2] ## age marital n prop ## <int> <fct> <int> <dbl> ## 1 18 Never married 89 0.978 ## 2 18 Married 2 0.0220 ## 3 19 Never married 234 0.940 ## 4 19 Divorced 3 0.0120 ## 5 19 Widowed 1 0.00402 ``` ] ]<!----> .panel[.panel-name[Figures] .pull-left[ **Original** ```r ggplot(by_age, aes(age, prop, colour = marital)) + geom_line(na.rm = TRUE) ``` <!-- --> ] .pull-right[ **Re-Ordered** ```r ggplot(by_age, aes(age, prop, colour = fct_reorder2(marital, age, prop))) + geom_line() + labs(color = "marital") ``` <!-- --> ] ]<!----> ]<!--end panelset--> --- .center[ # fct_infreq ] - `fct_infreq()` orders factor levels by increasing frequency - Useful in bar plots - Combine with `fct_rev()` to order by decreasing frequency .panelset[ .panel[.panel-name[fct_infreq] .pull-left[ ```r gss_cat %>% mutate(marital = fct_infreq(marital)) %>% ggplot(aes(marital)) + geom_bar() ``` ] .pull-right[ <!-- --> ] ]<!----> .panel[.panel-name[combine w/ fct_rev] .pull-left[ ```r gss_cat %>% mutate(marital = fct_rev(fct_infreq(marital))) %>% ggplot(aes(marital)) + geom_bar() ``` ] .pull-right[ <!-- --> ] ]<!----> ]<!--end panelset--> --- class: inverse, center, middle # Modifying Factor Levels ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # Modifying Factor Levels ] - Previous `fct_*` functions allowed you to re-order factor levels - These next functions will enable you to change the values of factor levels - `fct_recode`: allows you to change the value of each level - `fct_collapse`: allows you to combine factor levels - `fct_lump`: allows you to lump together small groups - `fct_drop`: allows you to remove factor levels that do not have any members --- .center[ # fct_recode ] - `fct_recode`: allows you to change the value of each level - "new name" = "old name" .panelset[ .panel[.panel-name[Original Data] .pull-left[ ```r gss_cat %>% count(partyid) %>% head(n = 7) ``` ] .pull-right[ |partyid | n| |:------------------|----:| |No answer | 154| |Don't know | 1| |Other party | 393| |Strong republican | 2314| |Not str republican | 3032| |Ind,near rep | 1791| |Independent | 4119| ] ]<!----> .panel[.panel-name[Recode] .pull-left[ ```r gss_cat %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat" )) %>% count(partyid) %>% head(n = 7) ``` ] .pull-right[ |partyid | n| |:---------------------|----:| |No answer | 154| |Don't know | 1| |Other party | 393| |Republican, strong | 2314| |Republican, weak | 3032| |Independent, near rep | 1791| |Independent | 4119| ] ]<!----> ]<!--end panelset--> --- .center[ # Combine Levels with fct_recode ] - To combine groups, you can assign multiple old levels to the same new level .pull-left[ ```r gss_cat %>% mutate(partyid = fct_recode(partyid, "Republican, strong" = "Strong republican", "Republican, weak" = "Not str republican", "Independent, near rep" = "Ind,near rep", "Independent, near dem" = "Ind,near dem", "Democrat, weak" = "Not str democrat", "Democrat, strong" = "Strong democrat", "Other" = "No answer", "Other" = "Don't know", "Other" = "Other party" )) %>% count(partyid) ``` ] .pull-right[ |partyid | n| |:---------------------|----:| |Other | 548| |Republican, strong | 2314| |Republican, weak | 3032| |Independent, near rep | 1791| |Independent | 4119| |Independent, near dem | 2499| |Democrat, weak | 3690| |Democrat, strong | 3490| ] --- .center[ # fct_collapse ] - `fct_collapse`: allows you to combine factor levels - useful when you want to collapse a lot levels - "new level" = c("old level 1", "old level 2", "old level 3") .pull-left[ ```r gss_cat %>% mutate(partyid = fct_collapse(partyid, other = c("No answer", "Don't know", "Other party"), rep = c("Strong republican", "Not str republican"), ind = c("Ind,near rep", "Independent", "Ind,near dem"), dem = c("Not str democrat", "Strong democrat") )) %>% count(partyid) ``` ] .pull-right[ |partyid | n| |:-------|----:| |other | 548| |rep | 5346| |ind | 8409| |dem | 7180| ] --- .center[ # fct_lump ] - `fct_lump` combines smallest levels into an "Other" level group - default behavior is to lump together the smallest groups ensuring that the aggregate group is still the smallest group - you can use the `n` parameter to specify how many groups (excluding other) you want to keep .panelset[ .panel[.panel-name[Default] .pull-left[ ```r gss_cat %>% mutate(relig = fct_lump(relig)) %>% count(relig) ``` ] .pull-right[ |relig | n| |:----------|-----:| |Protestant | 10846| |Other | 10637| ] ]<!----> .panel[.panel-name[Use n] .pull-left[ ```r gss_cat %>% mutate(relig = fct_lump(relig, n = 4)) %>% count(relig, sort = TRUE) %>% ``` ] .pull-right[ |relig | n| |:----------|-----:| |Protestant | 10846| |Catholic | 5124| |None | 3523| |Other | 1301| |Christian | 689| ] ]<!----> ]<!--end panelset--> --- .center[ # fct_drop ] - `fct_drop`: allows you to remove factor levels that do not have any members ```r f <- factor(c("a", "b"), levels = c("a", "b", "c")) f ``` ``` ## [1] a b ## Levels: a b c ``` ```r fct_drop(f) ``` ``` ## [1] a b ## Levels: a b ``` --- .center[ # Now You Try! ] - Change the factor levels of the `gss_cat` variable `marital` to the following three categories: - No Answer - Not Married - Combination of Never Married, Separated, Divorced, and Widowed - Married - Print out the frequency count of these three categories - Now, perform the same task using a different `fct_*` function --- .center[ # Solution ] **Solution 1** ```r gss_cat %>% mutate(marital = fct_recode(marital, "Unmarried" = "Never married", "Unmarried" = "Separated", "Unmarried" = "Divorced", "Unmarried" = "Widowed")) %>% count(marital) ``` **Solution 2** ```r gss_cat %>% mutate(marital = fct_collapse(marital, "Unmarried" = c("Never married", "Separated", "Divorced", "Widowed"))) %>% count(marital) ``` --- class: inverse, center, middle # Recap! ---- <svg viewBox="0 0 581 512" style="position:relative;display:inline-block;top:.1em;fill:white;height:3em;" xmlns="http://www.w3.org/2000/svg"> <path d="M581 226.6C581 119.1 450.9 32 290.5 32S0 119.1 0 226.6C0 322.4 103.3 402 239.4 418.1V480h99.1v-61.5c24.3-2.7 47.6-7.4 69.4-13.9L448 480h112l-67.4-113.7c54.5-35.4 88.4-84.9 88.4-139.7zm-466.8 14.5c0-73.5 98.9-133 220.8-133s211.9 40.7 211.9 133c0 50.1-26.5 85-70.3 106.4-2.4-1.6-4.7-2.9-6.4-3.7-10.2-5.2-27.8-10.5-27.8-10.5s86.6-6.4 86.6-92.7-90.6-87.9-90.6-87.9h-199V361c-74.1-21.5-125.2-67.1-125.2-119.9zm225.1 38.3v-55.6c57.8 0 87.8-6.8 87.8 27.3 0 36.5-38.2 28.3-87.8 28.3zm-.9 72.5H365c10.8 0 18.9 11.7 24 19.2-16.1 1.9-33 2.8-50.6 2.9v-22.1z"></path></svg> --- .center[ # Recap! ] .pull-left[ **Modifying Factor Order** - `fct_reorder` <br> <br> - `fct_relevel` <br> <br> - `fct_reorder2` <br> <br> - `fct_infreq` - `fct_rev` <br> <br> ] .pull-right[ **Modifying Factor Levels** - `fct_recode` <br> <br> - `fct_collapse` <br> <br> - `fct_lump` <br> <br> - `fct_drop` <br> <br> ]